Southern Region
MAD-Fact: A Multi-Agent Debate Framework for Long-Form Factuality Evaluation in LLMs
Ning, Yucheng, Lin, Xixun, Fang, Fang, Cao, Yanan
The widespread adoption of Large Language Models (LLMs) raises critical concerns about the factual accuracy of their outputs, especially in high-risk domains such as biomedicine, law, and education. Existing evaluation methods for short texts often fail on long-form content due to complex reasoning chains, intertwined perspectives, and cumulative information. To address this, we propose a systematic approach integrating large-scale long-form datasets, multi-agent verification mechanisms, and weighted evaluation metrics. We construct LongHalluQA, a Chinese long-form factuality dataset; and develop MAD-Fact, a debate-based multi-agent verification system. We introduce a fact importance hierarchy to capture the varying significance of claims in long-form texts. Experiments on two benchmarks show that larger LLMs generally maintain higher factual consistency, while domestic models excel on Chinese content. Our work provides a structured framework for evaluating and enhancing factual reliability in long-form LLM outputs, guiding their safe deployment in sensitive domains.
- Asia > China > Beijing > Beijing (0.04)
- Europe > Middle East > Malta > Southern Region > Southern Harbour District > Luqa (0.04)
Cheems: A Practical Guidance for Building and Evaluating Chinese Reward Models from Scratch
Wen, Xueru, Lou, Jie, Li, Zichao, Lu, Yaojie, Yu, Xing, Ji, Yuqiu, Xu, Guohai, Lin, Hongyu, He, Ben, Han, Xianpei, Sun, Le, Zhang, Debing
Reward models (RMs) are crucial for aligning large language models (LLMs) with human preferences. However, most RM research is centered on English and relies heavily on synthetic resources, which leads to limited and less reliable datasets and benchmarks for Chinese. To address this gap, we introduce CheemsBench, a fully human-annotated RM evaluation benchmark within Chinese contexts, and CheemsPreference, a large-scale and diverse preference dataset annotated through human-machine collaboration to support Chinese RM training. We systematically evaluate open-source discriminative and generative RMs on CheemsBench and observe significant limitations in their ability to capture human preferences in Chinese scenarios. Additionally, based on CheemsPreference, we construct an RM that achieves state-of-the-art performance on CheemsBench, demonstrating the necessity of human supervision in RM training. Our findings reveal that scaled AI-generated data struggles to fully capture human preferences, emphasizing the importance of high-quality human supervision in RM development.
- North America > United States > California > San Francisco County > San Francisco (0.14)
- Asia > China > Beijing > Beijing (0.04)
- Europe > Middle East > Malta > Southern Region > Southern Harbour District > Luqa (0.04)
- (2 more...)
- Research Report > New Finding (0.48)
- Instructional Material > Training Manual (0.40)
The Best of Both Worlds: a Framework for Combining Degradation Prediction with High Performance Super-Resolution Networks
Aquilina, Matthew, Ciantar, Keith George, Galea, Christian, Camilleri, Kenneth P., Farrugia, Reuben A., Abela, John
To date, the best-performing blind super-resolution (SR) techniques follow one of two paradigms: A) generate and train a standard SR network on synthetic low-resolution - high-resolution (LR - HR) pairs or B) attempt to predict the degradations an LR image has suffered and use these to inform a customised SR network. Despite significant progress, subscribers to the former miss out on useful degradation information that could be used to improve the SR process. On the other hand, followers of the latter rely on weaker SR networks, which are significantly outperformed by the latest architectural advancements. In this work, we present a framework for combining any blind SR prediction mechanism with any deep SR network, using a metadata insertion block to insert prediction vectors into SR network feature maps. Through comprehensive testing, we prove that state-of-the-art contrastive and iterative prediction schemes can be successfully combined with high-performance SR networks such as RCAN and HAN within our framework. We show that our hybrid models consistently achieve stronger SR performance than both their non-blind and blind counterparts. Furthermore, we demonstrate our framework's robustness by predicting degradations and super-resolving images from a complex pipeline of blurring, noise and compression.
- Europe > Middle East > Malta > Eastern Region > Northern Harbour District > Msida (0.04)
- North America > United States (0.04)
- Europe > United Kingdom > Scotland > City of Edinburgh > Edinburgh (0.04)
- (3 more...)
- Overview (0.92)
- Research Report > New Finding (0.67)
Malta: The Innovation Island AIBC Summit
If you look at the past four years we've been enjoying substantial economic growth, in order for the economy to be resilient to external shocks we have to continue to diversify and explore new niches to sustain our economic growth. So for this reason we have delved into niche economic areas. We started 2 years ago with Blockchain and we've been attracting significant investment to our island, not only in terms of crypto but also in other areas of technological developments. Now we're seeing new development in companies that are investing in technology and coming here to work and operate from Malta. We're also seeing a spill-over effect, such as companies from the iGaming industry who are producing new products supported by blockchain technology.
- Europe > Middle East > Malta > Southern Region > Southern Harbour District > Luqa (0.05)
- Europe > France > Provence-Alpes-Côte d'Azur > Bouches-du-Rhône > Marseille (0.05)
- Africa > Middle East > Libya (0.05)
- Information Technology (1.00)
- Government (0.97)
- Law (0.71)
- (2 more...)